Bag of What? Simple Noun Phrase Extraction for Text Analysis

نویسندگان

  • Abram Handler
  • Matthew Denny
  • Hanna M. Wallach
  • Brendan T. O'Connor
چکیده

Social scientists who do not have specialized natural language processing training often use a unigram bag-of-words (BOW) representation when analyzing text corpora. We offer a new phrase-based method, NPFST, for enriching a unigram BOW. NPFST uses a partof-speech tagger and a finite state transducer to extract multiword phrases to be added to a unigram BOW. We compare NPFST to both ngram and parsing methods in terms of yield, recall, and efficiency. We then demonstrate how to use NPFST for exploratory analyses; it performs well, without configuration, on many different kinds of English text. Finally, we present a case study using NPFST to analyze a new corpus of U.S. congressional bills. For our open-source implementation, see http://slanglab.cs.umass.edu/phrases/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval

Information retrieval is an important application area of natural-language processing where one encounters the genuine challenge of processing large quantities of unrestricted natural-language text. This paper reports on the application of a few simple, yet robust and efficient nounphrase analysis techniques to create better indexing phrases for information retrieval. In particular, we describe...

متن کامل

Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information

In this paper we investigate the impact of candidate terms filtering using linguistic information on the accuracy of automatic keyphrase extraction from scientific papers. According to linguistic knowledge, the noun phrases are most likely to be keyphrases. However the definition of a noun phrase can vary from a system to another. We have identified five POS tag sequence definitions of a noun p...

متن کامل

An Endogeneous Corpus-Based Method for Structural Noun Phrase Disambiguation

In this paper, we describe a method for structural noun phrase disambiguation which mainly relies on the examination of the text corpus under analysis and doesn't need to integrate any domain-dependent lexicoor syntactico-semantic information. This method is implemented in the Terminology Extraction Sotware LEXTER. We first explain why the integration of LEXTER in the LEXTER-K project, which ai...

متن کامل

A Noun Phrase Parser of English

A noun phrase parser is useful for several purposes, e.g. for index term generation in an information retrieval application; for the extraction of collocational knowledge from large corpora for the development of computational tools for language analysis; for providing a shallow but accurately analysed input for a more ambitious parsing system; for the discovery of translation units, and so on....

متن کامل

Extracting Noun Phrases in Subject and Object Roles for Exploring Text Semantics

In tune with the recent developments in the automatic retrieval of text semantics, this paper is an attempt to extract one of the most fundamental semantic units from natural language text. The context is intuitively extracted from typed dependency structures basically depicting dependency relations instead of Part-Of-Speech tagged representation of the text. The dependency relations imply deep...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016